Assembly collapsing versus heterozygosity oversizing: detection of homokaryotic and heterokaryotic Laccaria trichodermophora strains by hybrid genome assembly

Abstract Genome assembly and annotation using short-paired reads is challenging for eukaryotic organisms due to their large size, variable ploidy and large number of repetitive elements. However, the use of single-molecule long reads improves assembly quality (completeness and contiguity), but haplotype duplications still pose assembly challenges. To address the effect of read length on genome assembly quality, gene prediction and annotation, we compared genome assemblers and sequencing technologies with four strains of the ectomycorrhizal fungus Laccaria trichodermophora. By analysing the predicted repertoire of carbohydrate enzymes, we investigated the effects of assembly quality on functional inferences. Libraries were generated using three different sequencing platforms (Illumina Next-Seq, Mi-Seq and PacBio Sequel), and genomes were assembled using single and hybrid assemblies/libraries. Long reads or hybrid assemby resolved the collapsing of repeated regions, but the nuclear heterozygous versions remained unresolved. In dikaryotic fungi, each cell includes two nuclei and each nucleus has differences not only in allelic gene version but also in gene composition and synteny. These heterokaryotic cells produce fragmentation and size overestimation of the genome assembly of each nucleus. Hybrid assembly revealed a wider functional diversity of genomes. Here, several predicted oxidizing activities on glycosyl residues of oligosaccharides and several chitooligosaccharide acetylase activities would have passed unnoticed in short-read assemblies. Also, the size and fragmentation of the genome assembly, in combination with heterozygosity analysis, allowed us to distinguish homokaryotic and heterokaryotic strains isolated from L. trichodermophora fruit bodies.

Total length 29,141,152 33,409,876 36,644,313 39,142,984 41,079,889 41,947   All values are considering contigs/scaffolds over 1Kb; Fly, Canu and Canu smash assemblies were made using Long-Reads, wile the remaining five with a hybrid approach: Pilon were made using the Canu results and PurgeHap using the Pilon results.

Figure S2 .
Figure S2.Quality control of raw and clean Next-Seq (76b) pair end reads of CA15-11 strain.

Figure S3 .
Figure S3.Quality control of raw and clean Mi-Seq (300b) pair end reads of CA15-11 strain.

Figure S4 .
Figure S4.Quality control of raw and clean Next-Seq (76b) pair end reads of CA15-75 strain.

Figure S5 .
Figure S5.Quality control of raw and clean Mi-Seq (300b) pair end reads of CA15-75 strain.

Figure S6 .
Figure S6.Quality control of raw and clean Next-Seq (76b) pair end reads of CA15-F10 strain.

Figure S7 .
Figure S7.Quality control of raw and clean Mi-Seq (300b) pair end reads of CA15-F10 strain.

Figure S8 .
Figure S8.Quality control of raw and clean Next-Seq (76b) pair end reads of EF-36 strain.

Figure S9 .
Figure S9.Quality control of raw and clean Mi-Seq (300b) pair end reads of EF-36 strain.

Figure S10 .
Figure S10.General workflow for the whole de novo Laccaria trichodermophora genome assembly.A) Short read assemblies, B) Long read of hybrid assemblies.Lilac boxes represent sequencing raw, trimmed or corrected data, low blue boxes represent assemblers, yellow boxes represent complementary software such as cleaner, mapping or polishing tools, green boxes represent resulting assemblies.

Figure S12 .
Figure S12.Icarus alignment viewer of L. trichodermophora CA15-11 assemblies generated with Quast using the CA15-11 hybrid (Canu + Pilon) assembly as reference.A) Display a fragment of the alignment in B. B) display first 177 scaffolds.

Figure S13 .
Figure S13.Icarus alignment viewer of L. trichodermophora EF-36 assemblies generated with Quast using the EF-36 hybrid (Canu + Pilon) assembly as reference.A) Display a fragment of the alignment in B. B) display first 31 scaffolds.

Figure S18 .
Figure S18.Collinearity analysis (mummer) between the contigs of two L. trichodermophora strains (CA15-11 and EF-36) containing the genes present in the MAT-A region of L. bicolor.It is observed that the contigs of the CA15-11 strain align twice against those of the EF-36 strain, except for the region that goes from 40,000 to 60,000 of EF-36.Red colors represent greater nucleotide identity than blue colors.

Figure S19 .
Figure S19.CAZyme genes gain and lose in four different strains/assemblies.

Figure S20 .
Figure S20.Heatmap comparing the predicted CAZy gene content.Row dendrogram clusters together enzymes with similar copy number means.

Table S3 .
Comparison between Laccaria trichodermophora EF-36 SPAdes assemblies generated with a different set of k-mer sizes.

Table S5 .
Comparison between Laccaria trichodermophora EF-36 best results of different short-reads genome assemblers.

Table S6 .
Long reads and hybrid assemblies' comparison.